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^ Abstract 

Do two data samples come from different distributions? Recent studies of this fundamental problem focused on 
embedding probability distributions into sufficiently rich characteristic Reproducing Kernel Hilbert Spaces (RKHSs), 
to compare distributions by the distance between their embeddings. We show that Regularized Maximum Mean 
Discrepancy (RMMD), our novel measure for kernel-based hypothesis testing, yields substantial improvements even 
when sample sizes are small, and excels at hypothesis tests involving multiple comparisons with power control. We 
derive asymptotic distributions under the null and alternative hypotheses, and assess power control. Outstanding 
results are obtained on: challenging EEG data, MNIST, the Berkley Covertype, and the Flare-Solar dataset. 

c/j I. Introduction 

lHj Homogeneity testing is an important problem in statistics and machine learning. It tests whether two 
samples are drawn from different distributions. This is relevant for many applications, for instance, schema 
> matching in databases [|9l, and speaker identification [[131 . Popular two-sample tests like Kolmogorov- 
CO Smirnov [2J and Cramer-von-Mises ifTTI are not capable of capturing statistical information of densities 
^-j- with high frequency features. Non-parametric kernel-based statistical tests such as Maximum Mean Dis- 
O crepancy (MMD) , ifTOl enable one to obtain greater power than such density based methods. MMD is 
lO applicable not only to Euclidean spaces IR n , but also to groups and semigroups flU, and to structures such 
O as strings or graphs in bioinformatics, and robotics problems, etc. 0]. Here we consider a regularized 
version of MMD to address hypothesis testing. 
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With more than two distributions to be compared simultaneously, we face the multiple comparisons 
setting, for which statistical methods exist to deal with the issue of multiple test correction ||23l . Given 
cd a prescribed global significance threshold a (type I error) for the set of all comparisons, however, the 
corresponding threshold per comparison becomes small, which greatly reduces the power of the test. In 
situations where one wants to retain the null hypothesis, tests with small a are not conservative. Our main 
contribution is the definition of a regularized MMD (RMMD) method. 

The regularization term in RMMD allows to control the power of the test statistic. The regularizer is 
set provably optimal for maximal power; there is no need for fine-tuning by the user. RMMD improves 
on MMD through higher power, especially for small sample sizes, while preserving the advantages of 
MMD. Power control enables us to look for true sets of null distributions among the significant ones in 
challenging multiple comparison tasks. 
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We provide experimental evidence of good performance on a challenging Electroencephalography (EEG) 
dataset, artificially generated periodic and Gaussian data, and the MNIST and Covertype datasets. We also 
assess power control with the Asymptotic Relative Efficiency (ARE) test. 

The paper is organized as follows. In section 2, we elaborate on hypothesis testing and define maximum 
mean discrepancy (MMD) as a metric. We describe how to use MMD for homogeneity testing, and how 
to extend it to multiple comparisons. In section 3, we define RMMD for hypothesis testing and compare 
it to MMD and Kernel Fisher Discriminant Analysis (KFDA), and assess power control through ARE. 
Additional empirical justification of our test on various datasets is presented in section 4. 



II. Statistical Hypothesis Testing 

A statistical hypothesis test is a method which, based on experimental data, aims to decide whether 
a hypothesis (called null or H ) is true or false, against an alternative hypothesis (Hi). The level of 
significance a of the test represents the probability of rejecting H under the assumption that H is true 
(type I error). A type II error (f3) occurs when we reject Hi although it holds. 

The power of the statistical test is usually defined as 1 — (3. A desirable property of a statistical test is 
that for a prescribed global significance level a the power equals one in the population limit. We divide 
the discussion of hypothesis testing into two topics: homogeneity testing and multiple comparisons. 

A. Maximum Mean Discrepancy (MMD) 

Embedding probability distributions into Reproducing Kernel Hilbert Spaces (RKHSs) yields a linear 
method that takes information of higher order statistics into account [0, ll2~0ll . j|2T|. Characteristic kernels 
||6l , [1211 . (SI injectively map the probability distribution onto its mean element in the corresponding 
RKHSs. The distance between the mean elements (ji) in the RKHS is known as MMD 0, El. The 
definition of MMD [9] is given in the following theorem: 

Theorem 1. Let (X,B) be a metric space, and let P, Q be two Borel probability measures defined on 
X. The kernel function k : X x X — > IR embeds the points x G X into the corresponding reproducing 
kernel Hilbert space H. Then P = Q if and only ifMMD(P, Q) = 0, where 

MMD(P, Q) := \\hp-iiq\\h 

= \\E P [k(x,.)]-E Q [k(y,.)]\\ H 

= {E XiX ^ P [k(x,x')] + E y:y/ ^ Q [k(y,y')} 

-2E x ^ P ^ Q [k(x,y)})^. (1) 

B. Homogeneity Testing 

A two-sample test investigates whether two samples are generated by the same distribution. To do 
testing, MMD can be used to measure the distance between embedded probability distributions in RKHS. 
Besides calculating the distance measure, we need to check whether this distance is significantly different 
from zero. For this, the asymptotic distribution of this distance measure is used to obtain a threshold on 
MMD values, and to extract the statistically significant cases. We perform a hypothesis test with null 
hypothesis H : P = Q and alternative Hi : P ^ Q on samples drawn from two distributions P and Q. 
If the result of MMD is close enough to zero, we accept H , which indicates that the distributions P 
and Q coincide; otherwise the alternative is assumed to hold. With a as a threshold on the asymptotic 
distribution of the empirical MMD (when P = Q) , the (1 — a)-quantile of this distribution is statistically 
significant. Our MMD test determines it by means of a bootstrap procedure. 
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C. Multiple Comparisons 

Statistical analysis of a data set typically needs testing many hypotheses. The multiple comparisons 
or multiple testing problem arises when we evaluate several statistical hypotheses simultaneously. Let 
a be the overall type I error, and let a denote the type I error of a single comparison in the multiple 
testing scenario. Maintaining the prescribed significance level of a in multiple comparisons yields a to be 
more stringent than a. Nevertheless, in many studies a = a is used without correction. Several statistical 
techniques have been developed to control a ll23l . We use the Dunn-Sidak method: For n independent 
comparisons in multiple testing, the significance level a is obtained by: a = 1 — (1 — a) n . As a decreases, 
the probability of type II error (fi) increases and the power of the test decreases. This requires to control 
(3 while correcting a. To tackle this problem, and to control (3, we define a new hypothesis test based 
on RMMD, which has higher power than the MMD-based test, in the next section. To compare the 
distributions in the multiple testing problem we use two approaches: one-vs-all and pairwise comparisons. 
In the one-vs-all case each distribution is compared to all other distributions in the family, thus M 
distributions require M — 1 comparisons. In the pairwise case each pair of distributions is compared at 
the cost of MiMzll comparisons. 



The main contribution of this paper is a novel regularization of MMD measure called RMMD. This 
regularization aims to provide a test statistics with greater power (power closer to 1 with a prescribed 
type I error a). Erdogmus and Principe |Q showed that — log ||/ip||^ is the Parzen window estimation 
of the Renyi entropy lfT6ll . With RMMD we obtain a statistical test with greater power by penalizing the 
term ||//p||^ + ||^q|||{- We formulate RMMD and its empirical estimator as follows: 



where k p , and kq are non-negative regularization constants. For simplicity we consider up = kq = k 
in many application, however, we can introduce prior knowledge about the complexity of distributions by 
choosing Kp ^ kq. The modified Jensen-Shanon divergence (JS) [[3 corresponding to RMMD is defined 
as: 



where H s denotes the (cross) entropy. Since k is positive, the absolute value of second term on the right- 
hand side of eq. (4) increases, leading to a higher weight for the mutual information than for the entropy 
(vice versa if k would be lower than -1). [] 

Here we summarize the notation needed in the next section. Given samples and 
drawn from distributions P and Q, respectively, the mean element, the cross-covariance operator and 
the covariance operator are defined as follows 0, J9): Ap = YaIi k{ x h •)> ^pq = ni+n 2 ^ p ~ 
Aq) ® (Ap - Aq)> and s p = ^ Y,iLi(H x h •) ® K x u •)) - (Ap ® Ap) . where u <S> v for u, v e H is 
defined for all / e H as (u ® v)f = (v, f)u u - The quantities (iq and Eg are defined analogously for the 
second sample {j/i}™^. The population counterparts, i.e., the population mean element and the population 
covariance operator are defined for any probability measure P as (/xp, = E[f(x)} for all / e H, and 
(f,T,pg)fi = covp[f(x),g(y)] for f,g GH. From now on we call Sp = Spq the between-distribution 
covariance. The pooled covariance operator (which we call also the within- distribution covariance) is 



III. Regularized Maximum Mean Discrepancy (RMMD) 



RMMD(P, Q) := MMD(P, Q) 2 - K P \\fi P \\ 2 H - « Q ||/i Q ||£ 
RMMD(P, Q) := \\ft P - fi Q \\ 2 H - n P \\fi P \\ 2 H - K Q \\fi Q \\ 2 H 



(2) 
(3) 



D(P, Q) := H S {P, Q)-(k+ 1)(H 3 (P) + H S {Q)) 



(4) 



denoted by: E w 




p + 




Q 



'RMMD with negative- valued « can be used in clustering as a divergence to compare clusters. We achieve greater entropy with broader 
clusters. The resulting clustering method avoids overfitting with narrow clusters. 
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A. Limit Distribution of RMMD Under Null and Fixed Alternative Hypotheses 

Now we derive the distribution of the test statistics under the null hypothesis of homogeneity H : P = Q 
(Theorem 2), which implies /xp = /xq and Sp = Sq = Y,\y. Consistency of the test is guaranteed 
by the form of the distribution under H\ : P ^ Q (Theorem 2). Assume that {xj}™^ and {yi}^ are 
independent samples from P and Q, respectively (a priori they are not equally distributed). Let z\ := (xi, yi), 
h(zi,Zj) := k(x i ,x j )+k(yi,y j )-k(xi,y j )-k(x j ,yi)-h'{zi,Zj),^h\Zi,Zj) = Kpk(xi,Xj)+KQk(yi,yj), 
and — > denotes convergence in distribution. Without loss of generality we assume n\ = n 2 = n, and 
Kp = kq = k. The proofs hold even when Kp ^ Kq. Based on Hoeffding lfl4ll . Theorem A (p. 192) and 
Theorem B (p. 193) by Serfling lfl~9ll . we can prove the following theorem: 

Theorem 2. If E[h 2 ] < oo, under H lt RMMD is asymptotically normally distributed 

r0 (RMMD - RMMD) Aa/"(0,<t 2 ), 

with variance a 2 = 4(E z [E z r[h(z, z') 2 }} — E 2 z ,[h(z, z')]), uniformly at rate \j\fn. Under H , the same 
convergence holds with a 2 = 4 (E z [ E z > [ h'[z,z') 2 } ]- E 2 ^ z ,[h! '(z, z')}) > 0. 

To increase the power of our RMMD-based test we need to decrease the variance under Hi in 

Theorem 2. The following Theorem can be used to obtain maximal power by setting k — 1. This will give 
us a fixed hyper-parameter — no need for user tuning. The optimal value of k decreases both the variance 
of Hi and H simultaneously and the fixed a is defined over the changed variance of H . 



Theorem 3. The highest power of RMMD is obtained for k = 1. 



Proof. Let denote A = k(xi,Xj) + k(yi,yj) and B = k(xi,yj) — k(xj,yi). Based on Theorem 2, the 
variance under Hi is obtained by: 

a 2 = 4(E z [E z ,[h(z,zy))-El z ,[h(z,z'))) 

= 4(£[((1 - K )A - B) 2 ] - (E 2 [(l - K )A - B])) 
= 4((1 - k) 2 (E[A 2 } - E 2 [A}) + E[B 2 ] - E 2 [B]) 

= 4((1 - /t) 2 var(A) +var( J B)), (5) 
where vai(A), and var(B) denote the variances. To get maximal power, we set 

d({\ - k 2 )v&t(A) + var( J B)) 



dn 

which yields k = 1. 



0, (6) 



B. Comparison between RMMD, MMD, and KFDA 

According to Theorem 8 by Gretton et al. , under the null hypothesis the test statistics of MMD 
degenerates. This corresponds to a 2 = in our Theorem 2. For large sample sizes the null distribution 
of MMD approaches in distribution as an infinite weighted sum of independent xl random variables, 
with weights equal to the eigenvalues of the within-distribution covariance operator Y> w . If we denote 

the test statistics based on MMD by ff MD , then ff MD A CYZUMtf ~ X )> where z i ~ -^(0. 2 ) 
are i.i.d. random variables, and C is a scaling factor. Harchaoui et al. lfi~3l introduced Kernel Fisher 
Discriminant Analysis (KFDA) as a homogeneity test by regularizing MMD with the within-distribution 
covariance operator. The maximum Fisher discriminant ratio defines this test statistic. The empirical KFDA 
test statistic is denoted as KFDA(P, Q) = Wl " 2 || / tp ~ MQ 1 ||^. To analyze the asymptotic behaviour 

of this statistics under the null hypothesis, Harchaoui et al. fl 1 311 consider two situations regarding the 
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regularization parameter 7 n : 1) one where 7„ is held fixed, obtaining the limit distribution similar to MMD 
under H ; 2) one where 7 n tends to zero slower than rr l l 2 . In the first situation the test statistic converges 
to ff FDA(7n) A CJ2Zi( X i + ln)~ l h(z 2 - 1). Thus, the test statistics based on KFDA normalizes the 
weights of xt random variables by using the covariance operator as the regularizes In comparison MMD 
is more sensitive to the information of higher order moments because of their bigger weights (larger 
eigenvalues of the covariance operator). In the second situation (applicable in practice only for very large 
sample sizes) the test statistics converges to j^ FDA (fn) ]\[(C, 1), where C is a constant. 

The asymptotic convergence of the test statistic based on RMMD is T^ MMD A/"(0,<t 2 ), where a 2 
is the variance of the function h in Theorem 2. The precise analytical normal distribution obtains higher 
power in RMMD. Because of the divergence (a 2 = in the asymptotic distribution) for MMD and KFDA, 
they use an estimation of the distribution under the null hypothesis which looses the accuracy and affect 
the power. In contrast to MMD and KFDA, RMMD is consistent since the divergence under the null 
hypothesis does not happen any more. RMMD is the generalized form of the test statistics based on 
MMD, which we obtain for k = 0. Moreover, by minimizing the variance of the normal distribution, we 
obtain the best power for k = 1 and thus the hyper-parameter k is fixed without requiring tuning by the 
user. 

In comparison to KFDA, RMMD does not require restrictive constraints to obtain high power. It also 
results in higher power than MMD and KFDA in cases with small sample size. The speed of power 
convergence in KFDA is O p (l), which is slower than O p (n~i) in RMMD when n — > oo. 

Regarding the computational complexity, for MMD a parametric model with lower order moments of 
the test statistics is used to estimate the value of MMD which degenerates under H , and which has no 
consistency or accuracy guarantee. In comparison, the bootstrap resampling and the eigen- spectrum of the 
gram matrix are more consistent estimates with computational cost of 0(n 2 ), where n is the number of 
samples [fTTTl . For RMMD, the convergence of the test statistic to a Normal distribution enables a fast, 
consistent and straightforward estimation of the null distribution within 0(n 2 ) time without the need of 
using an estimation method. The results of power comparison between these tests are reported in section 4. 

C. Asymptotic Relative Efficiency of Statistical Tests 

To assess the power control we use the asymptotic relative efficiency. This criterion shows that RMMD 
is a better test statistic and obtains higher power rather than KFDA and MMD with smaller sample size. 
Relative efficiency enables one to select the most effective statistical test quantitatively [fT51 . Let T and V 
be test statistics to be compared. The necessary sample size for the test statistics T to achieve the power 
1 — J3 with the significance level a is denoted by N T (a, 1 — (3). The relative efficiency of the statistical 
test T with respect to the statistical test V is given by: 

e Ty (a, 1 - /?) = N v (a, 1 - /3)/N T {a, 1 - (3). (7) 

Since calculating Nt{&, 1 — /3) is hard even for the simplest test statistics, the limit value eT,y(a; 1 — /?), 
as 1 — f3 — > 1, is used. The limiting value is called the Bahadur Asymptotic Relative Efficiency (ARE) 
denoted by e^ v . 

ef v := lim e TV (a, 1-/3), (8) 

1-/3-5-1 

The test statistic V is considered better than T, if ery is smaller than 1, because it means that V needs a 
lower sample size to obtain a power of 1 — /?, for the given a. In [fT3l . authors assessed the power control by 
means of analysis of local alternatives which work when we have very large sample size or when n tends 
to infinity. In this article, we focus our attention on the small sample size case, which is more challenging. 
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In section 4, we compute Cmmd,rmmd _ ^ I mmd > ' 6 mmd,kfda _ ;v mmd' an( * 6 kfda,rmmd _ tv KPk , 
using artificial datasets and two types of kernels, and we obtain smaller ARE for RMMD rather than 
KFDA and MMD. This means RMMD gives higher power with much smaller sample size. Results for 
different data sets are reported in Table 2, Figure 2, and Figure 3. 



IV. Experiments 

MMD [9] was experimentally shown to outperform many traditional two-sample tests such as the 
generalized Wald-Wolfowitz test, the generalized Kolmogorov-Smirnov (KS) test [2], the Hall-Taj vidi 
(Hall) test lfT2l . and the Biau-Gyorf test. It was shown lfT3l that KFDA outperforms the Hall-Taj vidi 
test. We select KS and Hall as traditional baseline methods, on top of which we compare RMMD, KFDA, 
and MMD. To experimentally evaluate the utility of the proposed hypothesis testing method, we present 
results on various artificial and real-world benchmark datasets. 



A. Artificial Benchmarks with Periodic and Gaussian Distributions 

Our proposed method can be used for testing the homogeneity of structured data, which is an ad- 
vantage over traditional two-sample tests. We artificially generated distributions from Locally Compact 
Abelian Groups (periodic data) and applied our RMMD-test to decide whether the samples come from 
the same distributions or not. Suppose the first sample is drawn from a uniform distribution P on 
the unit interval. The other sample is drawn from a perturbed uniform distribution Q u with density 
1 + sin(u;:r). For higher perturbation frequencies to it becomes harder to discriminate Q w from P. Since 
the distributions have a periodic nature, we use a characteristic kernel tailored to the periodic domain, 
k(x, y) = cosh(7r — (x — y) mo d 2w)- For 200 samples from each distribution, the type II error is computed 
by comparing the prediction to the ground truth over 1000 repetition. We average the results over 10 runs. 
The significance level is set to a = 0.05. We perform the same experiment with MMD, KFDA, KS and 
Hall. The powers of the homogeneity test for comparing P and Qq with the above mentioned methods 
are reported in Table 1 as Periodic 1. The best power is achieved by RMMD, and as expected, the results 
of kernel methods are better than traditional ones. 

Since the selection of the kernel is a critical choice in kernel-based methods, we also investigated the 
usage of a different kernel and replaced the previous kernel with k(x, y) — — log(l — 29 cos(x — y) + 9 2 ), 
where 9 is a hyperparameter. We report the best results achieved by 9 = 0.9 as Periodic2 in Table 1. The 
reader is referred to fl4), (111 for a detailed study on these kernels. 

We also report the results on the toy problem of comparing two 25-dimensional Gaussian distributions 
with 250 samples, both with zero mean vector but with covariance matrix 1.5 / and 1.8 /, respectively. 
This dataset is referred as Gaussian in Table 1. 

TABLE I 

The Power obtained on the periodic data, the Gaussian, the MNIST, Covertype, and Flare Solar datasets, by 

APPLYING RMMD WITH K = 0.8 FOR THE PERIODIC DATA AND K = 1 FOR THE OTHERS, AND KFDA WITH 7 = 10 _1 . 





RMMD 


KFDA 


MMD 


KS 


Hall 


Periodic 1 


0.40± 0.02 


0.24±0.01 


0.23± 0.02 


0.11± 0.02 


0.19 ±0.04 


PREIODIC2 


0.83± 0.03 


0.66±0.05 


0.56± 0.05 


0.11± 0.02 


0.19 ±0.04 


Gaussian 


1.00 


0.89 ± 0.03 


0.88 ±0.03 


0.04 ± 0.02 


1.00 


MNIST 


0.99± 0.01 


0.97±0.01 


0.95± 0.01 


0.12 ± 0.04 


0.77 ± 0.04 


Covertype 


1.00 


1.00 


1.00 


0.98±0.02 


0.00 


Flare-Solar 


0.93 


0.91 


0.89 


0.00 


0.00 
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An investigation of the effect of kernel selection and tuning parameters [|22| showed that best results 
for MMD can be achieved by those kernels and parameters that obtain supreme value for MMD. Our 
reported results agree. The results of kernel-based test statistics (RMMD, KFDA, and MMD) are improved 
by kernel justification and parameter tuning, and in all cases RMMD outperform KFDA and MMD. For 
instance, the result of periodic kernel with tuned hyper-parameter 9 is better than the one of the first 
periodic kernel without hyper-parameter (reported in Table 1 as Periodic2 and Periodic 1, respectively). 
For Gaussian kernel-processed datasets, the median distance between data points provided the best results. 
We used the 5-fold cross validation procedure to tune the parameters in our experiment. 

The effect of changing k on the power is simulated in two tests: first, by testing the similarity between 
the uniform distribution and Q±, and second with Q e . In both cases, the best power is obtained for k = 0.8. 
The results slightly differ from the theoretical value (k = 1) because of the relatively small sample sizes 
(ni = n 2 = 200) used for the tests. For samples with larger sizes we obtained maximal power with k — 1. 
The results are depicted in Figure 1. 



1| 1 1 1 1 1 1 1 1 r 




Fig. 1. Effect of k on the power of the test. The alternatives are Qe in the left and Q4 in the right figure. 



To assure that the statistical test is not aggressive for rejecting the null hypothesis, we reported the 
results of type I error for RMMD, KFDA, and MMD with different sample sizes in Figure 2. Both samples 
are supposed to be drawn from Q e . We used Gaussian kernel with a variance equals to medium distance 
of data points. The results averaged over 100 runs and the confidence interval obtained by 10 replicates. 
RMMD obtains zero type I error with smaller sample sizes, and the results of KFDA and MMD are 
comparable. 

To assess the power control of the test statistics we also compared Cmmd.rmmd' e MMD,KFDA' an d 
e KFDA rmmd under Hi when P is a uniform distribution and the alternative is Q 6 . We obtained smaller 
ARE for RMMD rather than for KFDA and MMD. This means RMMD gives higher power with fewer 
samples. Table 2 shows the results, averaged over 1000 runs, for periodic data (Periodic 1 and Periodic2). 
Figure 3 depicts the detailed results of the type II error for RMMD, MMD, and KFDA based on different 
sample sizes n. AREs are also calculated for more complex tasks. Consider the first sample is drawn 
from a uniform distribution P on the unit area. The other sample is drawn from the perturbed uniform 
distribution with density 1 + sin(cux)sin(uiy). For increasing values of to, the discrimination of 
from P becomes harder (Figure 4). The range of u changes between 1 to 6. We call these problems 
Punil to Puni6, respectively. The best results for all statistical kernel-based methods are achieved by 
using a characteristic kernel tailored to the periodic domain, k(x,y) = n? =1 1/(1 — 29cos(xi — yi) + 9 2 ), 
with 9 = 0.9 tuned using the 5-fold cross validation procedure. The results reported in Table 2 show 
much smaller values of ARE for RMMD rather than for KFDA and MMD. Figure 5 shows the detailed 
results of the type II error for RMMD, MMD, and KFDA based on different sample sizes n and different 
frequencies to. As displayed in Figure 5, RMMD obtains the robust result of zero type II error for 100 
samples over all different frequencies. Instead KFDA and MMD need much larger samples for the more 
difficult cases with larger to to obtain a power of one. 



s 



0.07 



0.06 



0.05 

| 0.04 
a 

£ 0.03 
0.02 
0.01 


20 40 60 80 100 120 140 160 

Sample size n 



Fig. 2. Type I error changed based on different sample size n. 



TABLE II 

The ARE obtained on the periodic data, by applying RMMD with k = 1, and 9 = 0.9 in periodic kernels, and KFDA 

WITH 7 = 10 -1 . 





e MMD,RMMD 


e MMD,KFDA 


e KFDA,RMMD 


Periodic 1 


0.71 


0.75 


0.93 


PREIODIC2 


0.75 


1 


0.75 


PuniI 


0.11 


0.78 


0.14 


PUNI2 


0.09 


0.82 


0.11 


PUNI3 


0.09 


0.82 


0.11 


PUNI4 


0.08 


0.85 


0.09 


PUNI5 


0.07 


0.88 


0.06 


PUNI6 


0.05 


0.81 


0.06 



B. MNIST, Covertype, and Flare-Solar Datasets 

Moving from synthetic data to standard benchmarks, we tested our method on three datasets: 1) the 
MNIST dataset of handwritten digits (LibSVM library: 10 classes, 5000 data points, and 784 dimensions); 
2) the Covertype dataset of forest cover types (LibSVM library: 7 classes, 1400 instances, and 54 
dimensions); 3) the Flare-Solar dataset (mldata.org: 2 classes, 100 instances, 10 dimensions). We compare 
the performance of RMMD with k = 1, KFDA with 7 = 1CT 1 and MMD, using the pairwise approach and 
testing for differences between the distributions of the classes, see Table 1. We average the results over 10 
runs. The family wide level is set to a = 0.05 (resulting in a = 0.0011, a = 0.0024 and a = 0.05 for each 
individual comparison for MNIST, Covertype and Flare-Solar datasets, respectively). The RMMD-based 
test achieves higher power than the other methods (see Table 1). 

C. Electroencephalography Data 

We recorded EEG from four subjects performing a visual task. A checkerboard was presented in the 
subject's left visual field. We refer to (25] for details on data collection and preprocessing. In our learning 
task, for each subject we have 64 signal distributions assigned to 64 electrodes. The data contain 360 
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Fig. 3. Type II error change based on different sample size n. On the left, the results with Periodic kernel 1 and on the right, the results 
with Periodic kernel 2. 






Fig. 4. The probability density function of Punil with u) = 1 on the left and the probability density function of Puni6 with ui — 6 on 
the right. As ui increases the probability density function looks more similar to the uniform distribution and the discrimination of P and 
becomes more difficult for the test statistics. 



instances of a 200 dimensional feature vector for each distribution. The goal of hypothesis testing is to 
disambiguate signals recorded from electrodes corresponding to early visual cortex from the rest. This 
is difficult because of the low signal-to-noise ratio and the similarity of the patterns of all electrodes. 
Moreover, the high number of electrodes makes this experiment a good candidate to assess the multiple 
comparison part of our method. In the one-vs-all approach the normalized distribution of each electrode 
is compared to the normalized combined distribution of the other 63 electrodes. RMMD with k = 1 
with Gaussian kernel is used as our hypothesis test. The parameter a of the Gaussian kernel is set to the 
median distance of data points. The results of our hypothesis test reject the null hypothesis and confirm 
the dissimilarity of distributions in 63 electrodes. The results of the pairwise approach with RMMD and 
MMD are depicted in Figure 6. 

Neuroscientists usually subjectively assess the results obtained from imaging techniques and inferred 
from machine learning. For instance, in the current experiment the expectation is that electrodes in 
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frequency m 



1000 
No. of samples 



frequency e 



No. of samples 



frequency a 



Fig. 5. On the left, different sample sizes n for different frequencies oj are shown. The type II error changes based on different sample 
sizes n and different frequencies uj, in the middle for the KFDA-based test, and on the right for the MMD-based test. 




a) The first subject 



b) The second subject 



c) The third subject 



d) The fourth subject 



Fig. 6. The results of RMMD and MMD as hypothesis tests on the EEG data recorded from 64 electrodes per subject in the top row and 
the bottom row, respectively. Categorized electrodes recognized by the two methods as related to the visual task are colored. 



region Al are categorized together by means of EEG imaging techniques and multiple comparisons. 
But electrodes of other area (such as A 2 and A 3 , see Figure 7) can be confused as belonging to A\ due to 
the high noise. Figure 7 describes the categorization of the electrodes. We assess our results quantitatively 
by means of False Discovery Rates (FDR), using the following FDRs to compare the results of RMMD 
to those of MMD: 

p n n (no. of electrodes categorized for the visual task in A2UA3UB) 

r UKo — jj , 

T-i r\ D ( n0 - °f electrodes categorized for the visual task in A3UB) 
r L)ti\ — jj , 

rinn (no. of electrodes categorized for the visual task in B) 

r Un.2 — jj , 



where U is the total number of electrodes categorized for the task. The results are depicted in Figure 7. 
RMMD obtained more robust and better results than MMD with smaller FDRs. 
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Fig. 7. The reference image of the EEG electrodes is shown on the left. We categorized electrodes into four groups as follows: Al, the 
electrodes corresponding to visual cortex in the region of interest, A2, the peripheral electrodes that can be wrongly detected due to noise, 
A3, the electrodes in the left visual cortex often detected due to noise or interrelation between brain areas, and B, all the remaining electrodes. 
On the right, the results of RMMD and MMD are quantitatively compared based of the FDRs defined in the text. The smallest and most 
robust FDRs are obtained by RMMD. 



V. Conclusion 

Our novel regularized maximum mean discrepancy (RMMD) is a kernel-based test statistic generalizing 
the MMD test. We proved that RMMD overpowers MMD and KFDA; power consistency is obtained with 
higher rate. Power control makes RMMD a good hypothesis test for multiple comparisons, especially for 
the crucial case of small sample sizes. In contrast to KFDA and MMD, the convergence of RMMD- 
based test statistics to the normal distribution under null and alternative hypotheses yields fast and 
straightforward RMMD estimates. Experiments with goldstandard benchmarks (MNIST, Covertype and 
Flare-Solar dataset) and with EEG data yield state of the art results. 
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